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Abstract 

We  present  a  method  to  convert  a  digital  single-lens- 
reflex  (DSLR)  camera  into  a  high  resolution  consumer  depth 
and  light  field  camera  by  affixing  an  external  aperture  mask 
to  the  main  lens.  Compared  to  the  existing  consumer  depth 
and  light  field  cameras,  our  camera  is  easy  to  construct  with 
minimal  additional  costs  and  our  design  is  camera  and  lens 
agnostic.  The  main  advantage  of  our  design  is  the  ease  of 
switching  between  an  SLR  camera  and  a  native  resolution 
depth/light  field  camera.  Using  an  external  mask  is  an  im¬ 
portant  advantage  over  current  light  field  camera  designs 
since  we  do  not  need  to  modify  the  internals  of  the  camera 
or  the  lens.  Our  camera  sequentially  acquires  the  angular 
components  of  the  light  field  of  a  static  scene  by  changing 
the  location  of  the  aperture  in  the  mask.  A  consequence 
of  our  design  is  that  the  external  aperture  causes  heavy 
vignetting  in  the  acquired  images.  We  calibrate  the  mask 
parameters  and  estimate  multi-view  scene  depth  under  vi¬ 
gnetting.  In  addition  to  depth,  we  show  light  field  appli¬ 
cations  such  as  refocusing  and  defocus  blur  at  the  sensor 
resolution. 


l.  Introduction 

Consumer  depth  cameras  using  coded  light  ETil  and 
time-of-flight  lITOl  have  become  extremely  popular  and  have 
lead  to  an  improved  performance  in  computer  vision  appli¬ 
cations.  These  active  depth  acquisition  techniques,  com¬ 
pared  to  passive  techniques  like  stereo,  provide  robustness 
in  low  light  and  are  accurate  but  are  relatively  expensive, 
consume  significant  power  and  have  limited  range.  As  an 
alternative,  light  field  cameras  are  an  emerging  technology 
for  capturing  scene  depth.  The  increasing  importance  of 
this  passive  depth  acquisition  technology  is  illustrated  by 
the  emergence  of  light  field  camera  companies  like  Lytro 

m,  Raytrix  m  and  Pelican  Imaging  0.  Light  field  cam¬ 
eras  are  a  generalization  of  stereo  cameras  and  sample  the 
angular  variation  in  the  incident  light  fields  in  addition  to 
the  usual  spatial  variation.  It  has  been  shown  recently  in 
0  that  light  fields  captured  at  wide  baseline  by  SLR  cam¬ 
eras  can  provide  high  quality  depth  information.  Though 
light  field  cameras  have  so  far  been  used  primarily  for  con¬ 
sumer  photography  and  scientific  imaging,  their  potential 


as  an  enabling  technology  for  computer  vision  is  immense 
E  In  addition  to  depth  acquisition,  the  angular  informa¬ 
tion  captured  by  light  field  cameras  could  improve  many 
computer  vision  problems  such  as  segmentation,  stabiliza¬ 
tion  and  material  classification.  However,  the  current  light 
field  cameras  have  poor  spatial  resolution  0  due  to  spatio- 
angular  tradeoff  or  are  significantly  expensive  0.  In  this 
paper  we  propose  a  method  to  convert  a  high  spatial  resolu¬ 
tion  DSLR  camera,  into  a  native  resolution  depth  and  light 
field  camera. 

Our  new  light  field  and  depth  camera  is  built  with  a 
DSLR  camera  and  an  external  aperture  mask  cut  from  black 
paper  as  shown  in  Figure  [I]  The  external  mask  affixed  to 
the  main  camera  lens  acts  as  a  modulator,  allowing  only 
light  rays  within  a  small  solid  angle.  We  capture  angular 
information  in  the  incident  light  field  by  changing  the  mask 
sequentially,  allowing  a  different  set  of  solid  angles  at  each 
instance.  Our  light  field  camera  is  very  easy  to  construct 
(making  paper  aperture  masks  takes  less  than  an  hour)  with 
minimal  marginal  cost,  provides  the  option  of  switching  be¬ 
tween  a  regular  and  light  field  camera  and  provides  a  high 
resolution  depth.  Furthermore,  our  design  altogether  avoids 
accessing  the  internals  of  the  lens,  redesigning  the  optics 
and  redesigning  the  basic  camera  processing  such  as  demo- 
saicing  and  color  processing  l20l . 

Our  design  is  motivated  by  the  ideas  of  programmable 
aperture  CO,  external  modulation  m  and  mask-based 
modulation  GE  However,  our  design  is  significantly  easier 
to  implement,  uses  no  additional  optical  elements  and  does 
not  tinker  with  the  lens  system.  This  makes  our  design  par¬ 
ticularly  attractive  as  a  consumer  depth  camera  since  any 
existing  DSLR  camera  can  be  converted  into  a  light  field 
camera  with  high  resolution  depth  with  just  an  additional 
aperture  mask.  The  key  insight  of  our  paper  is  that  it  is  not 
necessary  to  place  the  mask  in  the  aperture  plane  of  the  lens 
to  capture  angular  information  of  the  light  field.  A  simi¬ 
lar  mask  affixed  external  to  the  lens  can  capture  the  angular 
variation  in  the  light  field  as  well.  Our  assumption  is  that  the 
scene  is  significantly  farther  from  the  aperture  plane  than 
the  mask  from  the  aperture  plane  which  is  often  true  except 
in  macro  photography. 

The  placement  of  the  mask  in  front  of  the  lens,  removed 
from  the  aperture  plane,  causes  the  captured  images  to  be 
heavily  vignetted  as  shown  in  Figure  |T[  We  explain  the  vi¬ 
gnetting  mathematically  in  Section [3]and  show  that  each 


c)  Refocusing  d)  All  Focus  Image  e)  Recovered  Depth 

Figure  1.  a)  Our  setup:  DSLR  camera  and  external  paper  mask  b)  Acquired  vignetted  images  corresponding  to  different  5x5  mask  sub¬ 
apertures  c)  Examples  of  synthetic  refocusing  (zoom  into  the  PDF  to  see  the  out-of-focus  areas  in  each  image)  d)  An  all-in-focus  image  at 
the  central  sub-aperture.  Notice  the  absence  of  any  vignetting,  e)  Estimated  scene  depth  at  the  central  sub-aperture. 


a)  Mask  setup 


b)  Captured  Images 
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image  captures  a  sloped  slice  of  the  light  field  as  shown 
in  Figure  [2]  A  programmable  aperture  camera  H2l  on  the 
other  hand  captures  a  horizontal  slice  of  the  same  light  field. 

We  use  a  mask  with  a  5  x  5  sub-aperture  array  to  ac¬ 
quire  the  light-field  of  a  static  scene.  From  the  5  x  5  views 
we  estimate  multi-view  scene  depth  with  occlusion  reason¬ 
ing.  The  depth  estimation  is  particularly  hard  since  each 
captured  image  is  vignetted  and  has  only  a  limited  field-of- 
view  (f.o.v)  of  the  lens.  Our  problem  is  akin  to  the  multi¬ 
view  depth  estimation  problem  described  by  Kang  et  al.  [8] 
with  the  constraint  that  we  need  to  rely  on  robust  photocon¬ 
sistency  measures  due  to  vignetting.  The  lack  of  a  single 
image  with  full  f.o.v.  necessitates  the  depths  estimated  at 
each  image  to  be  fused  together  to  create  single  depth  im¬ 
age  for  the  scene. 

If  the  scene  is  nearly  Lambertian,  using  the  estimated 
depth  and  occlusions  we  can  interpolate  intermediate  views 
between  a  captured  5x5  array  of  images  to  generate  finer 
sampling  of  the  light  field  as  shown  in  Section  [5]  The  cap¬ 
tured  vignetted  images  can  then  be  transformed  into  non- 
vignetted,  all-focus  images  through  a  simple  resampling  of 
the  light  field  space.  The  finely  sampled  light  field  also  al¬ 
lows  us  to  achieve  alias  free  digital  refocusing. 

In  this  paper  we  have  presented  traditional  light  field  ap¬ 
plications  such  as  depth  estimation,  refocusing  and  all-focus 
images  but  the  information  provided  by  light  fields  is  much 
richer  and  goes  beyond  imaging  applications.  We  foresee 
light  field  data  improving  many  computer  vision  applica¬ 
tions  such  as  segmentation,  stabilization,  material  classifi¬ 
cation  and  recognition. 

We  note  that  currently  our  design  is  applicable  only  for 
static  scenes  since  we  cycle  through  5x5  array  of  apertures. 
Further,  our  reconstruction  technique  is  computationally  ex¬ 
pensive  since  the  multi-view  passive  stereo  requires  signif¬ 
icant  disparity  search  and  regularization.  Nevertheless,  we 
believe  the  advantages  offered  by  our  simple  design,  flexi¬ 
bility  and  little  marginal  costs  make  this  approach  exciting 


as  a  consumer  depth  camera.  In  summary,  our  contributions 
are: 

1 .  A  consumer  depth  and  light  field  camera  design  which 
can  be  built  easily  and  flexibly  from  a  DSLR  camera 
and  an  external  paper  mask  with  little  marginal  costs. 

2.  Estimating  high  resolution  depth  under  vignetting  and 
limited  field-of-view  enabling  view  interpolation  and 
light  field  reparameterization. 

2.  Related  Work 

Depth  cameras  and  reconstruction:  Since  light  field  cam¬ 
eras  are  a  generalization  of  stereo  cameras,  they  inherit  the 
advantages  and  disadvantages  of  the  passive  depth  estima¬ 
tion  techniques  compared  to  the  active  methods  such  as 
coded  light  imaging  and  ToF  imaging,  i.e.  they  involve  little 
marginal  cost,  can  work  in  bright  scenes,  are  not  power  hun¬ 
gry  and  have  large  range  but  are  computationally  expensive, 
perform  poorly  in  low  light  and  have  low  depth  sensitivity. 
But  since  our  light  field  camera  is  built  upon  a  high  reso¬ 
lution  DSLR  it  achieves  the  high  native  spatial  resolution 
unlike  most  consumer  depth  cameras.  Recently  0  showed 
that  high  quality  depth  can  be  recovered  from  finely  sam¬ 
pled  light  fields  but  the  datasets  were  acquired  over  a  sig¬ 
nificantly  wide  baseline  and  required  external  camera  cali¬ 
bration.  Since  our  capture  uses  a  single  center- of-projection 
no  external  camera  calibration  is  needed. 

Our  depth  estimation  procedure  relies  on  constructing  a 
disparity  space  image  (DSI)  by  reparameterizing  the  light 
field  (refocusing)  and  searching  for  the  best  focus  or  pho¬ 
toconsistency  of  pixels  in  the  array  images.  The  stereo  cor¬ 
respondence  chapter  in  Szeliski’s  book  mi  provides  an  ex¬ 
cellent  overview  of  different  techniques  available  for  depth. 
During  multi-view  depth  estimation,  for  each  image  view 
we  warp  the  other  views  to  this  image  and  estimate  the 
depth  by  using  a  varying  spatio-temporal  window  as  de¬ 
scribed  in  Kang  et  al.  (8).  The  occlusions  are  reasoned  from 


Figure  2.  a)  The  scene  with  axis  u0  and  u  —  pu0  in  the  conjugate  plane  where  p  is  the  spatial  magnification,  the  axis  s  in  the  aperture 
plane,  the  axis  m0  in  the  external  mask  plane  and  the  corresponding  axis  m  in  the  virtual  mask  plane.  The  light  field  is  parameterized  as 
L(u,  s )  and  the  colored  scene  points  have  corresponding  colored  support  lines  in  b)  and  c).  b)  The  modulation  by  mask  with  a  sub-aperture 
k  is  shown  by  a  sloped  band  with  the  slope  —p  given  by  p  —  .  Since  the  mask  sub-aperture  band  has  a  limited  spatial  extent,  the 

image  Ik  ( u )  at  the  sensor  is  vignetted  and  has  a  limited  f.o.v.  c)  When  the  mask  is  in  the  aperture  plane,  m  =  s,  the  band  is  horizontal  and 
there  is  no  vignetting. 


the  depth  estimates  by  checking  for  photoconsistency  across 
views.  For  spatial  consistency  of  depth  and  to  fill  the  holes 
in  textureless  regions  we  use  graph  cut  techniques  based  on 
MRFs  (6).  Since  the  angular  dimension  is  not  finely  sam¬ 
pled  with  our  5x5  sub-aperture  mask,  the  epipolar  plane 
image  (EPI)  based  techniques  for  depth  estimation  141. ITT! 
cannot  be  applied  to  our  data. 

Comparison  with  existing  light  field  designs:  Our  light 
field  camera  design  is  novel  and  builds  on  the  theory  and 
design  principles  laid  out  in  the  previous  light  field  cameras. 
A  good  overview  of  the  sampling  of  the  plenoptic  function 
can  be  found  in  the  survey  work  by  Wetzstein  et  al.  fT8l 
and  Zhou  et  al.  1221.  We  use  the  terminology  in  Zhou  et  al. 
(22  ]  to  classify  the  previous  light  field  capture  methods  into 
three  classes:  sensor  side  modulation ,  aperture  modulation 
and  object  side  modulation. 

Sensor  side  modulation:  The  basic  idea  of  sensor-side  mod¬ 
ulation  is  to  project  the  angular  information  of  the  light  field 
onto  the  spatial  dimension.  This  is  accomplished  by  using 
either  lenslet  arrays  (HIII1III1  or  masks  HD  in  front  of  the 
sensor.  These  techniques  usually  leave  high-frequency  pat¬ 
terns  making  demoisaicing  and  color  processing  hard  [20]. 
Furthermore,  lenslets  introduce  optical  aberrations.  More 
importantly,  these  techniques  require  significant  modifica¬ 
tions  to  the  hardware  that  offers  no  flexibility  to  switch 
between  capturing  light  fields  or  regular  images.  Our  de¬ 
sign  fundamentally  avoids  any  internal  access  and  is  easy 
to  construct,  flexible  and  requires  no  reinvention  of  camera 
processing. 

Aperture  modulation:  Modulation  in  the  aperture  plane  m 
does  not  project  the  angular  dimension  of  the  light  field  on 
the  sensor,  preserving  the  full  resolution  of  the  spatial  di¬ 
mension.  Instead  angular  resolution  is  gained  by  sacrificing 
temporal  resolution.  Levin  et  al.  CD  proposed  a  coded 
aperture  technique  for  depth  estimation  but  it  was  not  used 
for  light  field  capture.  These  techniques  require  the  lens 
body  to  be  accessed  to  place  the  mask  in  the  optical  path¬ 
way,  since  the  aperture  plane  in  a  regular  lens  system  is  in¬ 
side  the  lens.  Our  design  shows  that  the  angular  resolution 


can  be  sampled  even  by  placing  the  mask  external  to  the  lens 
in  front  of  the  camera  instead  of  the  aperture  plane  fl2l .  We 
note  that  the  vignetting  encountered  in  Liang  et  al.  m  is 
primarily  due  to  cosine  falloff  whereas  the  vignetting  in  our 
camera  is  a  consequence  of  our  design. 

Object  side  modulation:  External  modulation  offers  flex¬ 
ibility  and  avoids  reinventing  camera  processing.  An  ex¬ 
ample  which  avoids  temporal  tradeoff  is  by  Georgiev  et  al. 
(71.  They  use  an  external  concave  lens  array  with  prisms  to 
achieve  spatio-angular  tradeoff  by  packing  the  angular  in¬ 
formation  contiguously.  Nevertheless  their  system  requires 
careful  engineering  of  the  external  lens  system  and  the  ad¬ 
ditional  relay  lens  can  be  bulky.  The  additional  optical  el¬ 
ements  also  change  the  effective  focal  length  of  the  system 
and  also  introduce  aberrations. 

The  closest  design  to  our  camera  is  the  lensless  two  plane 
mask  based  camera  by  Zomet  and  Nayar  [23  ].  Their  design 
allows  wider  applications  than  light  field  capture  but  suffers 
from  image  quality  due  to  lack  of  a  lens  [12].  On  the  other 
hand,  our  design  introduces  no  additional  optical  elements 
and  can  be  built  from  simple  opaque  paper  with  easy  post 
capture  calibration.  Our  external  mask  also  allows  the  use 
of  different  aperture  sizes  and  configurations  without  mod¬ 
ifying  the  effective  focal  length. 

Multiple  cameras  in  an  array  can  be  used  to  capture  light 
fields  with  a  wide  baseline  (in  addition  to  other  applica¬ 
tions)  and  was  demonstrated  by  Stanford’s  camera  array 
m.  However,  this  system  is  expensive,  not  portable  and 
requires  careful  synchronization  and  calibration. 
Parameterization  and  calibration:  We  parameterize  the 
light  field  with  two  planes  at  aperture  and  the  sensor  like  in 
the  previous  mask  based  design  O,  GO.  We  show  that 
an  external  mask  is  mathematically  equivalent  to  a  scaled 
and  inverted  internal  mask  close  to  the  aperture  plane.  Each 
image  is  a  sloped  slice  of  the  light  field  in  the  two  plane 
parameterization  and  the  angle  is  given  by  the  ratio  of  the 
distance  between  the  mask  to  aperture  and  sensor  176).  We 
calibrate  the  mask  to  determine  the  mask  offset  from  the 
principal  point  and  the  aperture  plane  axes. 


3.  Mask  Modulation  and  Calibration 

We  first  present  the  basics  of  light  fields  and  external 
mask  modulation  with  a  2D  light  field.  Consider  the  scene 
shown  in  Figure  [2j  a)  where  the  camera  is  imaging  the  scene 
and  a  mask  with  sub-aperture  is  placed  in  front  of  the  main 
lens.  We  parameterize  the  light  field  external  to  the  cam¬ 
era  as  L0(u0 ,  s )  and  the  light  field  internal  to  the  camera 
as  simply  L(u,  s )  where  s  is  the  axis  at  the  aperture  plane 
and  uQ  and  u  are  axes  in  the  conjugate  object  and  sensor 
planes  respectively.  The  distances  do  and  d  of  the  planes  uo 
and  u  from  the  aperture  plane  s  are  related  by  the  thin  lens 
equation  ^  \  —  j  where  /  is  the  effective  focal  length 

of  the  lens  system.  Since  the  sensor  captures  a  magnified 
(and  inverted)  image  of  the  light  field,  we  have  u  =  puQ 
where  p  =  —  ^  is  the  spatial  magnification.  This  means 
that  the  internal  light  field  is  a  scaled  and  flipped  version  of 
the  external  light  field.  The  image  captured  at  the  sensor  is 
an  integration  of  the  light  field  over  the  aperture  plane  i.e. 
I(u)  =  fsL(u,s). 

The  external  mask  is  a  5  x  5  grid  of  sub-apertures  at¬ 
tached  to  the  lens.  We  sequentially  acquire  25  images  by 
opening  each  of  these  sub-apertures.  Note  that  a  coded  aper¬ 
ture  03  acquisition  would  provide  better  noise  properties 
but  that  is  not  the  focus  of  this  paper.  Like  the  light  field, 
the  external  mask  axis  m0  in  the  mask  plane  has  a  virtual 
flipped  and  scaled  mask  axis  m  in  the  virtual  mask  plane 
inside  the  camera.  Hence,  we  consider  only  the  light  field 
inside  the  camera  and  investigate  the  effect  of  the  internal 
mask  sub-aperture  on  the  light  field.  Let  the  distance  from 
the  aperture  axis  5  to  the  mask  axis  m  be  dm.  We  define  the 
ratio  p  =  dd^i  relating  the  axis  s ,  m  and  u  as 

5  =  -pU  +  m(  1  +  p).  (1) 

The  mask  sub-aperture  modulates  the  light  field  and  the 
modulation  is  shown  as  a  sloped  band  in  the  light  field  space 
with  slope  —p  in  Figure  [2jb).  The  modulated  light  field  is 
given  by 

(fc+ 0.5)A 

Lk(u,s)  =  /  L(u,  s)S(s  +  pu  —  (1  +  p)m)  ds. 


When  the  mask  is  in  the  aperture  plane  03  as  shown  in 
Figure [2jc),  p  =  0  and  s  =  m. 

The  image  captured  with  the  kth  sub  aperture  open  is 
Ik(u )  =  fg  Lk(u ,  s ).  Since  the  modulation  band  does  not 
span  the  entire  sensor  range,  the  image  Ik(u )  is  vignetted 
and  has  a  limited  f.o.v.  as  illustrated  in  Figure [2]  Note  that 
as  the  f-number  of  the  camera  increases,  the  f.o.v.  decreases 
since  the  modulation  band  has  smaller  range  in  s,  thus  re¬ 
stricting  the  range  in  u.  As  the  sub-aperture  k  changes,  the 
modulation  band  shifts  in  both  u  and  5  resulting  in  a  shifted 
f.o.v.  and  a  parallax  shift. 

Consider  the  three  colored  points  in  the  scene  at  different 
depths  in  Figure  |2ja).  The  support  of  the  three  points  in  the 
light  field  space  is  given  by  the  sloped  lines.  The  point  in 


Figure  3.  Overlayed  im¬ 
ages  captured  at  /5.6  with 
the  central  sub- aperture  of 
the  5  x  5  mask  and  sub¬ 
apertures  above  and  next  to 
it.  We  calibrate  the  offset, 
orientation  and  the  shift  in 
f.o.v.  in  pixels  by  localiz¬ 
ing  the  blob  centers. 


focus  has  support  parallel  to  the  s  axis.  This  implies  that 
the  integration  of  the  light  field  at  the  sensor  induces  no 
blur  of  the  point.  The  point  farther  from  the  camera  has 
a  negative  slope  and  the  point  closer  to  the  camera  has  a 
positive  slope  resulting  in  defocus  blur  in  the  image.  The 
slope  of  the  lines  give  the  distance  of  the  point  from  the 
focal  plane.  Determining  the  depth  of  the  points  is  nothing 
but  estimating  the  slope  of  the  lines  in  the  light  field  space 
from  the  vignetted  images  Ik(u )  and  is  discussed  in  Section 

m 

In  4D,  the  light  field  is  represented  as  L(u,  v,  s,  t)  with 
the  mask  plane  defined  by  axis  m  and  n.  For  simplicity  we 
choose  the  aperture  axes  s ,  t  to  align  with  the  mask  axes  m, 
n  which  may  not  be  aligned  with  the  sensor  axes  u,  v. 
Mask  calibration:  We  need  to  calibrate  the  masks  so  that 
we  can  determine  the  axis  of  the  sub-apertures.  In  practice, 
a  2D  mask  center  can  be  offset  from  the  principal  point  and 
the  mask  axes  m  and  n  may  not  align  with  the  image  axes 
u  and  v.  We  capture  calibration  images  of  a  diffuse  white 
board  to  estimate  the  offset  of  the  mask  center  by  measuring 
the  shift  in  f.o.v  in  pixels  and  estimate  the  mask  axis  rota¬ 
tion.  We  pick  / 5.6  and  photograph  the  white  screen  with 
mask  locations  at  the  center  and  one  each  along  the  axes  as 
shown  in  Figure  [3]  to  localize  the  vignetted  image  centers 
accurately.  The  center  image  gives  the  offset  from  the  prin¬ 
cipal  point.  The  images  along  different  axes  gives  the  direc¬ 
tion  of  the  mask  axis  with  respect  to  the  sensor  axis  and  the 
shift  in  pixels  along  the  axis  gives  the  f.o.v.  shift  in  pixels. 
The  offset,  f.o.v.  shift  and  axis  rotation  allow  us  to  achieve 
physically  accurate  refocusing  and  depth  estimation. 

4.  Depth  Estimation 

Depth  estimation  of  the  scene  underlies  many  of  the  ap¬ 
plications  of  light  fields  such  as  view  interpolation,  alias- 
free  refocusing  and  multi-view  image  fusion.  But  estimat¬ 
ing  the  depth  from  vignetted  images  acquired  by  an  exter¬ 
nal  mask  is  challenging.  In  this  section  we  first  briefly  de¬ 
scribe  the  previous  approaches  to  light  field  depth  estima¬ 
tion.  We  pose  the  problem  of  estimating  scene  depth  as 
that  of  aligning  the  sensor  axis  with  the  light  field  gradient. 
Based  on  this  formulation  we  present  our  approach  to  multi¬ 
view  depth  estimation  and  depth  fusion  under  vignetting. 
Previous  light  field  depth  estimation:  Depth  of  scene 
points  from  4D  light  field  L(u,v,s,t)  can  be  determined  by 
estimating  the  gradient  of  their  support  in  the  epipolar  plane 
image  (EPI)  L(u,s)  and  L(v,t)  as  described  in  (4).  This 
approach  was  improved  by  Wanner  et  al.  ifTTl  by  further 
reasoning  about  occlusions  in  the  EPI  using  a  structure  ten- 


Not  aligned  Aligned 

Figure  4.  a)  The  scene  with  blue  and  red  occluding  lines.  To  estimate  depth  we  reparameterize  the  sensor  plane  by  a  factor  a  to  u  .  The 
light  field  of  the  lines  is  shown  as  shaded  overlapping  blue  and  red  regions  in  b)  and  c).  b)  The  images  at  sub-apertures  k  —  1,  k  and  k  +  1 
show  the  right  edges  of  blue  regions  not  aligned  and  progressively  occluding  the  red  region,  c)  After  reparameterization  by  factor  a  >  1, 
the  new  sensor  plane  u  is  perpendicular  to  the  blue  border  in  L(v! ,  s).  The  right  edges  of  the  blue  region  are  now  aligned  in  the  warped 
image.  Depth  estimation  is  simply  searching  for  a  which  aligns  the  image  intensities  in  the  warped  images. 


sor  framework.  Both  the  methods  were  designed  for  finely 
sampled  angular  dimensions  (s,  t).  When  the  angular  sam¬ 
ples  are  limited,  the  depth  of  the  scene  points  is  estimated 
through  traditional  stereo  matching  techniques.  Liang  et  al. 
m  perform  multi- view  depth  estimation  at  each  of  the  sub¬ 
aperture  images  Ik(u )  and  occlusion  is  reasoned  between 
every  pair  of  neighboring  views. 

In  this  paper,  we  have  access  to  L(u ,  m)  and  not  L(u,  s ). 
Hence  we  estimate  depth  with  L(u,m)  and  explain  our 
method  in  terms  of  L(u,  m)  as  well. 

Depth  estimation  as  light  field  alignment:  In  multi- view 
stereo,  the  depth  of  the  scene  at  a  reference  view  is  esti¬ 
mated  by  warping  the  other  views  to  that  view  and  check¬ 
ing  for  photoconsistency  of  the  scene  points.  The  warping 
is  a  homography  transformation  corresponding  to  a  virtual 
scene  depth  [[§).  In  light  fields,  the  homography  transforma¬ 
tion  corresponding  to  a  virtual  depth  is  simply  a  reparame¬ 
terization  to  a  virtual  sensor  plane  u'  as  shown  in  Figure 
@Jb)  and|4|c). 

La(u\  m)  =  L(au  +  (1  —  a)m,  m).  (3) 

The  reparameterization  factor  a  indicates  the  scene  depth. 
a  >  1  corresponds  to  moving  u'  away  from  the  aperture 
plane  bringing  the  virtual  depth  closer  and  a  <  1  corre¬ 
sponds  to  moving  the  virtual  depth  farther.  When  the  light 
field  is  reparameterized,  the  images  Ik(u)  are  warped  to 
Ika{n')  as  illustrated  in  Figure  [4]  In  Figure  (4^b),  the  red  and 
blue  lines  in  the  scene  correspond  to  regions  which  intersect 
in  the  light  field  L(u,  s ).  As  the  sub-aperture  k  is  changed, 
the  blue  and  red  regions  in  the  image  Ik  (u)  move  closer  to 
each  other  with  the  blue  region  finally  occluding  the  red  re¬ 
gion.  The  right  edge  of  the  blue  region  in  image  Ik(u )  is 
also  shifting  right.  When  the  light  field  is  reparameterized 
as  shown  in  Figure  |4jc),  the  right  edges  of  the  blue  region 
across  different  views  align.  Since  blue  is  closer  to  the  cam¬ 
era,  the  factor  a  >  1  .  This  corresponds  to  moving  the  sen¬ 


sor  axis  u'  away  and  rotating  the  light  field  anti-clockwise. 

In  other  words  we  search  for  a  which  makes  the  axis  uf 
perpendicular  to  the  blue  line  in  La(uf ,  m).  Likewise,  we 
search  for  a  which  makes  the  red  line  perpendicular  to  v! . 

Multi- view  stereo  under  vignetting:  In  our  camera,  the 
sub-aperture  images  Ik(u )  are  vignetted  and  have  limited 
f.o.v.  of  the  scene.  We  describe  the  multi-view  depth  es¬ 
timation  under  vignetting  in  Figure  [5]  Since  the  vignetted 
pixels  are  unreliable,  we  do  not  use  those  pixels  for  depth 
estimation  and  the  depth  is  estimated  at  only  the  non- 
vignetted  pixels.  All  masked  images  Ik(u )  are  warped  (an 
affine  transformation)  to  Ik  ( u' )  by  a  factor  a  corresponding 
to  a  virtual  depth.  Note  that  the  pixel  u'  in  Ikr  ( u' )  changes 
with  changing  a.  To  estimate  the  scene  depth  at  a  reference 
kr ,  we  apply  another  affine  transformation  to  all  views  to 
ensure  that  Ikr(u')  =  Ik(u )  for  every  a.  This  additional 
warping  is  only  for  practical  purposes  and  helps  avoid  the 
problem  of  tracking  pixels  u  of  Ikr  across  different  a. 

Next,  we  construct  a  disparity  space  image  (DSI) 

D(u,  a,  kr)  lfl5ll  to  estimate  the  depth  at  the  view  kr.  DSI  at 
pixel  u  of  sub-aperture  image  kr  quantifies  the  photoconsis¬ 
tency  of  other  views  at  that  pixel  when  the  views  are  warped 
by  factor  a.  The  DSI  is  constructed  as 

D(u,a,kr)  =  J2fVkar(u),Ika(u))  (4) 

k 

The  function  /()  is  a  robust  measure  of  photoconsistency 
which  measures  the  difference  in  color  as  well  as  image  gra¬ 
dients  at  the  pixel  across  views  and  is  given  by 

=  (1— X)\h(u)- I2(u)\+\\Vuh(u)-\7uI2(u)\. 

(5) 

In  Equation  ([4]),  temporal  selection  0  is  done  at  each  pixel 
to  weed  out  poor  view  matches  (such  as  vignetted  regions 
of  some  views)  and  remove  their  contribution  to  DSL 
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(Figure  4)  to  Target  Image 

Figure  5.  We  estimate  the  scene  depth  at  a  reference  view  kr  by  checking  with  other  warped  views  for  photoconsistency.  In  each  image 
we  only  consider  the  non-vignetted  regions  of  the  image  for  depth  estimation.  The  masked  raw  images  are  first  warped  by  a  factor  a 
corresponding  to  virtual  sensor  position  u' .  These  images  are  further  affine  transformed  to  ensure  that  the  pixel  u  —  u  in  Ikr  for  every  a. 
Then  the  image  view  k  is  compared  with  kr  for  photoconsistency.  The  resulting  depth  dkr  (u)  is  estimated  only  at  the  non-vignetted  pixels 
and  is  regularized  with  an  MRF  to  fill  holes. 


Depths  from  multiple  views  Fused  Depth 

Figure  6.  To  generate  the  full  depth  at  a  reference  view  we  warp  the 
depth  at  other  views  to  the  reference  view  and  perform  visibility 
reasoning  to  handle  occlusions. 


The  depth  dkr  ( u )  at  the  reference  view  kr  is  then  given 
by 

dkr(u)  =  argminD(w,  a,  kr).  (6) 

a 

We  then  employ  a  standard  MRF  based  depth  estimation 
ca  to  ensure  spatial  consistency  and  to  fill  holes  in  the 
smooth  regions.  The  unary  potential  at  pixel  u  is  given  by 

E(u)  =  D(u,au).  (7) 

and  the  smoothness  constraints  between  neighboring  pixels 

ui  and  U2  is 


E(u1,u2) 


1 


max(|aUl 


-aU2\,a)  (8) 


Multi-view  depth  fusion:  The  depth  dkr  (u)  estimated  at 
reference  kr  is  limited  to  non-vignetted  pixels  and  corre¬ 
sponds  to  only  a  fraction  of  the  f.o.v.  of  the  scene.  But 
different  views  k  have  different  f.o.v.  regions.  Hence  we 
use  the  depth  of  the  scene  points  from  other  views  to  com¬ 
plete  the  depth  information  at  kr.  The  procedure  shown  in 
Figure [6]  warps  the  depth  dk(u)  at  pixel  u  to  the  view  kr  by 
an  affine  transformation  corresponding  to  the  virtual  scene 


depth  dk{u).  Note  that  some  scene  points  occluded  in  view 
kr  will  be  visible  in  other  views.  This  causes  conflict  in  the 
depth  estimates  in  the  occluding  regions  when  other  views 
are  warped  to  kr.  We  resolve  the  conflict  in  such  regions 
by  performing  visibility  reasoning  i.e.  we  simply  take  a 
minimum  of  all  warped  depths.  The  combined  depths  from 
different  views  at  the  central  view  is  shown  in  Figure  [6] 

5.  Applications 

High  spatial  resolution  depth  and  light  fields  are  a  rich 
source  of  information  about  the  plenoptic  function  and  po¬ 
tentially  useful  for  many  computer  vision  applications  such 
as  segmentation,  stabilization  and  recognition  ns.  In  this 
paper  we  restrict  our  focus  to  light  field  imaging  applica¬ 
tions  IQ  and  hope  the  emergence  of  light  field  cameras 
will  spur  research  in  their  use  in  computer  vision  applica¬ 
tions  as  well.  The  estimation  of  dense  scene  depth  at  the 
sensor  resolution  allows  us  to  implement  the  standard  light 
field  applications  Q  despite  the  lower  angular  resolution 
of  our  captured  light  field.  We  use  the  depth  of  the  scene 
points  to  fuse  images  from  multiple  views  to  achieve  an  all¬ 
in-focus  image.  Multi-view  depth  information  also  makes 
occlusion  reasoning  easy,  enabling  view  interpolation  be¬ 
tween  the  sub-aperture  views.  The  interpolated  views  allow 
us  to  overcome  the  aliasing  in  the  angular  dimension  en¬ 
abling  alias-free  refocusing. 

All-in-focus  images:  Examples  of  all-in-focus  images  are 
shown  in  Figure  [T]  and  Figure  [9]  Since  each  view  is  vi¬ 
gnetted,  the  all-in-focus  image  is  created  by  borrowing  pix¬ 
els  from  different  views.  Using  the  depth  information  at 
the  source  view,  we  determine  the  amount  of  warp  needed 
to  transform  the  source  image  to  the  reference  view.  But  a 
naive  warp  of  the  images  to  the  reference  view  will  cause 
tearing  artifacts.  To  prevent  that,  we  reason  about  the  pix¬ 
els  which  will  be  occluded  in  the  reference  view.  The  depth 
information  at  both  these  views  allows  us  to  determine  the 
occlusions  and  disocclusions  through  visibility  reasoning. 
We  use  the  estimated  occlusion  map  along  with  the  required 
warp  to  propagate  pixel  values  to  the  reference  view  creat- 


Figure  7.  The  view  k  m  (2.5,  2.5)  has  been  interpolated  from 
the  views  k  =  (2,2)  and  k  m  (3,3)  and  visualized  through  the 
difference  images.  White  areas  denote  the  least  difference  and 
colored  areas  have  high  difference.  Notice  that  the  difference  is 
larger  only  at  the  farther  and  nearer  ends  of  the  scene  where  the 
motion  is  largest. 
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Comparison  at  red  window  Comparison  at  green  window 

Figure  8.  Digital  refocusing  done  at  two  different  scene  depths. 
Notice  that  the  refocused  images  are  not  aliased  despite  the  low 
angular  resolution  since  we  integrate  over  the  interpolated  views. 


ing  a  seamless  all-in-focus  image. 

View  Interpolation:  Given  the  all-in-focus  images  be¬ 
tween  two  neighboring  views  we  interpolate  the  interme¬ 
diate  views.  Figure  [7]  shows  the  interpolated  view  k  = 
(2.5,  2.5)  between  the  sub-aperture  views  k  =  (2,2)  and 
k  =  (3,3).  The  interpolated  image  has  been  visualized 
through  difference  images.  Notice  that  the  difference  is 
larger  at  the  farther  and  closer  parts  of  the  scene  where  the 
motion  is  largest.  We  note  that  the  knowledge  of  depth  al¬ 
lows  alias-free  interpolation  compared  to  ghosting  seen  in  a 
simple  alpha  blending.  We  discuss  this  more  in  the  supple¬ 
mentary  material  and  also  provide  videos  of  smooth  transi¬ 
tion  in  viewpoints  along  interpolated  views. 

Refocusing:  We  refocus  at  different  depths  of  the  scene  by 
warping  the  light  field  to  the  virtual  sensor  position  u'  and 
then  integrate  over  the  synthetic  aperture  window  W. 

Z)  La{u’,m).  (9) 

rriEW 

In  Figure  [8j  we  show  the  refocusing  at  two  different  scene 


depths.  Notice  that  the  scene  has  no  aliasing  despite  limited 
angular  resolution  of  our  camera  since  we  integrate  over  the 
interpolated  views. 

6.  Conclusions 

We  presented  a  novel  consumer  depth  and  light  field 
camera  built  with  a  DSLR  camera  and  an  external  mask. 
The  key  feature  of  our  design  is  the  ability  to  convert  any 
camera  into  a  light  field  camera  at  will  and  extract  high  res¬ 
olution  depth  with  minimal  marginal  costs.  We  hope  that 
this  design  will  be  a  starting  point  for  further  investigation 
into  simple,  easy-to-build  consumer  depth  and  light  field 
capture  devices  which  can  be  used  for  solving  a  wide  range 
of  computer  vision  problems.  The  sampling  of  the  angular 
resolution  of  the  light  field,  in  addition  to  high  resolution 
depth,  also  provides  additional  information  to  improve  the 
quality  of  vision  applications  such  as  segmentation,  track¬ 
ing  and  classification  and  we  hope  to  explore  this  in  future 
work.  In  this  paper  we  demonstrated  that  our  design  allows 
acquisition  of  quality  depth  information  even  with  a  small 
baseline  of  the  lens  aperture,  enabling  imaging  applications 
such  as  refocusing  and  multi- view  all-in-focus  images. 

Our  method  currently  captures  the  light  field  of  a  static 
scene  at  full  sensor  resolution.  Since  we  capture  the  differ¬ 
ent  sub-apertures  sequentially,  we  tradeoff  temporal  resolu¬ 
tion  to  gain  angular  resolution.  We  adopted  a  multi-view 
stereo  reconstruction  framework  for  depth  estimation  due 
to  limited  f.o.v  and  small  angular  resolution  of  the  acquired 
light  field.  This  makes  our  depth  estimation  computation¬ 
ally  expensive  and  hence  our  reconstruction  is  not  real-time. 
The  design  of  a  sequence  of  external  mask  patterns  which 
makes  the  acquisition  fast  and  exploration  of  fast  multi¬ 
view  depth  algorithms  are  an  avenue  for  future  work. 
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